Data Analytics (Machine Learning) - Compas Score¶

Project Dataset, Machine Learning Course. \ Master Degree in Artificial Intelligence and Computer Science. \ a.y. 2021/2022\ \ Group 404 Name Not Found

  • Canonaco Martina [231874]
  • Gabriele Giada [235799]
  • Gena Davide [231873]
  • Morello Michele [223953]

COMPAS Recidivism Dataset¶

Business Understanding¶

Scenario¶

  • Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a case management and decision support tool developed and owned by Northpointe (now Equivant) used by U.S. courts to assess the likelihood of a defendant becoming a recidivist.
  • The dataset consists of several arrests logged in Broward County, Florida.
  • U.S. courts are interested in understanding what are the factors that lead to criminals to be recidivist.
  • The COMPAS dataset contains different features describing the demographics and criminal history of the defendants.

Goal¶

  • The goal is to understand whether a defendant had reoffended after the arrest or not.

Import modules¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
plt.style.use('ggplot')

Read the Dataset¶

In [2]:
dt = pd.read_csv('compas-scores.csv', sep=',')
In [3]:
dt
Out[3]:
id name first last compas_screening_date sex dob age age_cat race ... vr_offense_date vr_charge_desc v_type_of_assessment v_decile_score v_score_text v_screening_date type_of_assessment decile_score.1 score_text screening_date
0 1 miguel hernandez miguel hernandez 2013-08-14 Male 1947-04-18 69 Greater than 45 Other ... NaN NaN Risk of Violence 1 Low 2013-08-14 Risk of Recidivism 1 Low 2013-08-14
1 2 michael ryan michael ryan 2014-12-31 Male 1985-02-06 31 25 - 45 Caucasian ... NaN NaN Risk of Violence 2 Low 2014-12-31 Risk of Recidivism 5 Medium 2014-12-31
2 3 kevon dixon kevon dixon 2013-01-27 Male 1982-01-22 34 25 - 45 African-American ... 2013-07-05 Felony Battery (Dom Strang) Risk of Violence 1 Low 2013-01-27 Risk of Recidivism 3 Low 2013-01-27
3 4 ed philo ed philo 2013-04-14 Male 1991-05-14 24 Less than 25 African-American ... NaN NaN Risk of Violence 3 Low 2013-04-14 Risk of Recidivism 4 Low 2013-04-14
4 5 marcu brown marcu brown 2013-01-13 Male 1993-01-21 23 Less than 25 African-American ... NaN NaN Risk of Violence 6 Medium 2013-01-13 Risk of Recidivism 8 High 2013-01-13
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11752 11753 patrick hamilton patrick hamilton 2013-09-22 Male 1968-05-02 47 Greater than 45 Other ... NaN NaN Risk of Violence 1 Low 2013-09-22 Risk of Recidivism 3 Low 2013-09-22
11753 11754 raymond hernandez raymond hernandez 2013-05-17 Male 1993-06-24 22 Less than 25 Caucasian ... NaN NaN Risk of Violence 5 Medium 2013-05-17 Risk of Recidivism 7 Medium 2013-05-17
11754 11755 dieuseul pierre-gilles dieuseul pierre-gilles 2014-10-08 Male 1981-01-24 35 25 - 45 Other ... NaN NaN Risk of Violence 3 Low 2014-10-08 Risk of Recidivism 4 Low 2014-10-08
11755 11756 scott lomagistro scott lomagistro 2013-12-03 Male 1986-12-04 29 25 - 45 Caucasian ... NaN NaN Risk of Violence 2 Low 2013-12-03 Risk of Recidivism 3 Low 2013-12-03
11756 11757 chin yan chin yan 2014-01-11 Male 1982-02-19 34 25 - 45 Asian ... NaN NaN Risk of Violence 1 Low 2014-01-11 Risk of Recidivism 1 Low 2014-01-11

11757 rows × 47 columns

Data Understanding¶

Data dimension (n_rows x n_columns)¶
In [4]:
dt.shape
Out[4]:
(11757, 47)
Number of elements in the dataset¶
In [5]:
dt.size
Out[5]:
552579
List of the attributes¶
In [6]:
print(dt.columns.tolist())
['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'num_r_cases', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'is_violent_recid', 'num_vr_cases', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_decile_score', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'decile_score.1', 'score_text', 'screening_date']
Attributes Types¶
In [7]:
dt.dtypes
Out[7]:
id                           int64
name                        object
first                       object
last                        object
compas_screening_date       object
sex                         object
dob                         object
age                          int64
age_cat                     object
race                        object
juv_fel_count                int64
decile_score                 int64
juv_misd_count               int64
juv_other_count              int64
priors_count                 int64
days_b_screening_arrest    float64
c_jail_in                   object
c_jail_out                  object
c_case_number               object
c_offense_date              object
c_arrest_date               object
c_days_from_compas         float64
c_charge_degree             object
c_charge_desc               object
is_recid                     int64
num_r_cases                float64
r_case_number               object
r_charge_degree             object
r_days_from_arrest         float64
r_offense_date              object
r_charge_desc               object
r_jail_in                   object
r_jail_out                  object
is_violent_recid             int64
num_vr_cases               float64
vr_case_number              object
vr_charge_degree            object
vr_offense_date             object
vr_charge_desc              object
v_type_of_assessment        object
v_decile_score               int64
v_score_text                object
v_screening_date            object
type_of_assessment          object
decile_score.1               int64
score_text                  object
screening_date              object
dtype: object
First 10 records¶
In [8]:
dt.head(10)
Out[8]:
id name first last compas_screening_date sex dob age age_cat race ... vr_offense_date vr_charge_desc v_type_of_assessment v_decile_score v_score_text v_screening_date type_of_assessment decile_score.1 score_text screening_date
0 1 miguel hernandez miguel hernandez 2013-08-14 Male 1947-04-18 69 Greater than 45 Other ... NaN NaN Risk of Violence 1 Low 2013-08-14 Risk of Recidivism 1 Low 2013-08-14
1 2 michael ryan michael ryan 2014-12-31 Male 1985-02-06 31 25 - 45 Caucasian ... NaN NaN Risk of Violence 2 Low 2014-12-31 Risk of Recidivism 5 Medium 2014-12-31
2 3 kevon dixon kevon dixon 2013-01-27 Male 1982-01-22 34 25 - 45 African-American ... 2013-07-05 Felony Battery (Dom Strang) Risk of Violence 1 Low 2013-01-27 Risk of Recidivism 3 Low 2013-01-27
3 4 ed philo ed philo 2013-04-14 Male 1991-05-14 24 Less than 25 African-American ... NaN NaN Risk of Violence 3 Low 2013-04-14 Risk of Recidivism 4 Low 2013-04-14
4 5 marcu brown marcu brown 2013-01-13 Male 1993-01-21 23 Less than 25 African-American ... NaN NaN Risk of Violence 6 Medium 2013-01-13 Risk of Recidivism 8 High 2013-01-13
5 6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male 1973-01-22 43 25 - 45 Other ... NaN NaN Risk of Violence 1 Low 2013-03-26 Risk of Recidivism 1 Low 2013-03-26
6 7 marsha miles marsha miles 2013-11-30 Male 1971-08-22 44 25 - 45 Other ... NaN NaN Risk of Violence 1 Low 2013-11-30 Risk of Recidivism 1 Low 2013-11-30
7 8 edward riddle edward riddle 2014-02-19 Male 1974-07-23 41 25 - 45 Caucasian ... NaN NaN Risk of Violence 2 Low 2014-02-19 Risk of Recidivism 6 Medium 2014-02-19
8 9 steven stewart steven stewart 2013-08-30 Male 1973-02-25 43 25 - 45 Other ... NaN NaN Risk of Violence 3 Low 2013-08-30 Risk of Recidivism 4 Low 2013-08-30
9 10 elizabeth thieme elizabeth thieme 2014-03-16 Female 1976-06-03 39 25 - 45 Caucasian ... NaN NaN Risk of Violence 1 Low 2014-03-16 Risk of Recidivism 1 Low 2014-03-16

10 rows × 47 columns

Attributes information¶
In [9]:
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11757 entries, 0 to 11756
Data columns (total 47 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       11757 non-null  int64  
 1   name                     11757 non-null  object 
 2   first                    11757 non-null  object 
 3   last                     11757 non-null  object 
 4   compas_screening_date    11757 non-null  object 
 5   sex                      11757 non-null  object 
 6   dob                      11757 non-null  object 
 7   age                      11757 non-null  int64  
 8   age_cat                  11757 non-null  object 
 9   race                     11757 non-null  object 
 10  juv_fel_count            11757 non-null  int64  
 11  decile_score             11757 non-null  int64  
 12  juv_misd_count           11757 non-null  int64  
 13  juv_other_count          11757 non-null  int64  
 14  priors_count             11757 non-null  int64  
 15  days_b_screening_arrest  10577 non-null  float64
 16  c_jail_in                10577 non-null  object 
 17  c_jail_out               10577 non-null  object 
 18  c_case_number            11015 non-null  object 
 19  c_offense_date           9157 non-null   object 
 20  c_arrest_date            1858 non-null   object 
 21  c_days_from_compas       11015 non-null  float64
 22  c_charge_degree          11757 non-null  object 
 23  c_charge_desc            11008 non-null  object 
 24  is_recid                 11757 non-null  int64  
 25  num_r_cases              0 non-null      float64
 26  r_case_number            3703 non-null   object 
 27  r_charge_degree          11757 non-null  object 
 28  r_days_from_arrest       2460 non-null   float64
 29  r_offense_date           3703 non-null   object 
 30  r_charge_desc            3643 non-null   object 
 31  r_jail_in                2460 non-null   object 
 32  r_jail_out               2460 non-null   object 
 33  is_violent_recid         11757 non-null  int64  
 34  num_vr_cases             0 non-null      float64
 35  vr_case_number           882 non-null    object 
 36  vr_charge_degree         882 non-null    object 
 37  vr_offense_date          882 non-null    object 
 38  vr_charge_desc           882 non-null    object 
 39  v_type_of_assessment     11757 non-null  object 
 40  v_decile_score           11757 non-null  int64  
 41  v_score_text             11752 non-null  object 
 42  v_screening_date         11757 non-null  object 
 43  type_of_assessment       11757 non-null  object 
 44  decile_score.1           11757 non-null  int64  
 45  score_text               11742 non-null  object 
 46  screening_date           11757 non-null  object 
dtypes: float64(5), int64(11), object(31)
memory usage: 4.2+ MB

Statistics on Attributes¶

In [10]:
dt.describe().T
Out[10]:
count mean std min 25% 50% 75% max
id 11757.0 5879.000000 3394.097892 1.0 2940.0 5879.0 8818.0 11757.0
age 11757.0 35.143319 12.022894 18.0 25.0 32.0 43.0 96.0
juv_fel_count 11757.0 0.061580 0.445328 0.0 0.0 0.0 0.0 20.0
decile_score 11757.0 4.371268 2.877598 -1.0 2.0 4.0 7.0 10.0
juv_misd_count 11757.0 0.076040 0.449757 0.0 0.0 0.0 0.0 13.0
juv_other_count 11757.0 0.093561 0.472003 0.0 0.0 0.0 0.0 17.0
priors_count 11757.0 3.082164 4.687410 0.0 0.0 1.0 4.0 43.0
days_b_screening_arrest 10577.0 -0.878037 72.889298 -597.0 -1.0 -1.0 -1.0 1057.0
c_days_from_compas 11015.0 63.587653 341.899711 0.0 1.0 1.0 2.0 9485.0
is_recid 11757.0 0.253806 0.558324 -1.0 0.0 0.0 1.0 1.0
num_r_cases 0.0 NaN NaN NaN NaN NaN NaN NaN
r_days_from_arrest 2460.0 20.410569 74.354840 -1.0 0.0 0.0 1.0 993.0
is_violent_recid 11757.0 0.075019 0.263433 0.0 0.0 0.0 0.0 1.0
num_vr_cases 0.0 NaN NaN NaN NaN NaN NaN NaN
v_decile_score 11757.0 3.571489 2.500479 -1.0 1.0 3.0 5.0 10.0
decile_score.1 11757.0 4.371268 2.877598 -1.0 2.0 4.0 7.0 10.0

Null values¶

In [11]:
dt.isnull().sum()
Out[11]:
id                             0
name                           0
first                          0
last                           0
compas_screening_date          0
sex                            0
dob                            0
age                            0
age_cat                        0
race                           0
juv_fel_count                  0
decile_score                   0
juv_misd_count                 0
juv_other_count                0
priors_count                   0
days_b_screening_arrest     1180
c_jail_in                   1180
c_jail_out                  1180
c_case_number                742
c_offense_date              2600
c_arrest_date               9899
c_days_from_compas           742
c_charge_degree                0
c_charge_desc                749
is_recid                       0
num_r_cases                11757
r_case_number               8054
r_charge_degree                0
r_days_from_arrest          9297
r_offense_date              8054
r_charge_desc               8114
r_jail_in                   9297
r_jail_out                  9297
is_violent_recid               0
num_vr_cases               11757
vr_case_number             10875
vr_charge_degree           10875
vr_offense_date            10875
vr_charge_desc             10875
v_type_of_assessment           0
v_decile_score                 0
v_score_text                   5
v_screening_date               0
type_of_assessment             0
decile_score.1                 0
score_text                    15
screening_date                 0
dtype: int64

Plot null values¶

In [12]:
dt.isna().sum()[dt.isna().sum()>0].sort_values().plot(kind='bar', ylabel='count null values', figsize=(15,5))
Out[12]:
<AxesSubplot:ylabel='count null values'>
In [13]:
null_Values=dt.isnull().sum()/len(dt)
print(null_Values*100)
null_Values[null_Values>0].sort_values().plot(kind='bar', ylabel='percentage null values', figsize=(15,5))
id                           0.000000
name                         0.000000
first                        0.000000
last                         0.000000
compas_screening_date        0.000000
sex                          0.000000
dob                          0.000000
age                          0.000000
age_cat                      0.000000
race                         0.000000
juv_fel_count                0.000000
decile_score                 0.000000
juv_misd_count               0.000000
juv_other_count              0.000000
priors_count                 0.000000
days_b_screening_arrest     10.036574
c_jail_in                   10.036574
c_jail_out                  10.036574
c_case_number                6.311134
c_offense_date              22.114485
c_arrest_date               84.196649
c_days_from_compas           6.311134
c_charge_degree              0.000000
c_charge_desc                6.370673
is_recid                     0.000000
num_r_cases                100.000000
r_case_number               68.503870
r_charge_degree              0.000000
r_days_from_arrest          79.076295
r_offense_date              68.503870
r_charge_desc               69.014204
r_jail_in                   79.076295
r_jail_out                  79.076295
is_violent_recid             0.000000
num_vr_cases               100.000000
vr_case_number              92.498086
vr_charge_degree            92.498086
vr_offense_date             92.498086
vr_charge_desc              92.498086
v_type_of_assessment         0.000000
v_decile_score               0.000000
v_score_text                 0.042528
v_screening_date             0.000000
type_of_assessment           0.000000
decile_score.1               0.000000
score_text                   0.127584
screening_date               0.000000
dtype: float64
Out[13]:
<AxesSubplot:ylabel='percentage null values'>

Building histograms¶

Numerical Attributes are:

  • id
  • age
  • juv_fel_count
  • decile_score
  • juv_misd_count
  • juv_other_count
  • priors_count
  • days_be_screening_arrest
  • c_days_from_compas
  • is_recid
  • num_r_cases
  • r_days_from_arrest
  • is_violent_recid
  • num_vr_cases
  • v_decile_score

It is not usefull to show the histograms of all these attributes. For example, id is a number to identify each record and it is an incremental counter. \ By analyzing the result of the function describe() the attributes num_r_cases and num_vr_cases are null. Also, decile_score.1 is a duplicate and the attributes is_recid and is_violent_recid are our class label that we want to transform them in boolean attributes. \ So we want to plot only the following numerical attributes:

In [14]:
numericDT = dt[dt.columns.difference(
    ['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericDT.hist(figsize=(15,15))
plt.show()
In [15]:
numericDF = dt[dt.columns.difference(['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericAttributes =  numericDF.columns.difference(['is_recid', 'is_violent_recid', 'c_days_from_compas'])
for attribute in numericAttributes:
    sn.histplot(x = dt[attribute], hue = 'is_recid', data = numericDF, kde=True)
    plt.show()
In [16]:
numericDF = dt[dt.columns.difference(['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericAttributes =  numericDF.columns.difference(['is_recid', 'is_violent_recid', 'c_days_from_compas'])
for attribute in numericAttributes:
    sn.histplot(x = dt[attribute], hue = 'is_violent_recid', data = numericDF, kde=True)
    plt.show()

Non Numeric Attributes are:

  • name
  • first
  • last
  • dob
  • age_cat
  • sex
  • race
  • compas_screening_date
  • c_jail_in
  • c_jail_out
  • c_case_number
  • c_offense_date
  • c_arrest_date
  • c_charge_degree
  • c_charge_desc
  • r_case_number
  • r_charge_degree
  • r_offense_date
  • r_charge_desc
  • r_jail_in
  • r_jail_out
  • vr_case_number
  • vr_charge_degree
  • vr_offense_date
  • vr_charge_desc
  • v_type_of_assessment
  • v_score_text
  • v_screening_date
  • type_of_assessment
  • score_text
  • screening_date

Most of them are text description, like c_charge_desc, r_charge_desc, name, first and last. So they are not usefull for ou analysis. Other attributes are identification codes (such as id, c_charge_degree, vr_charge_degree) and some others have a unique value (like type_of_assessment and v_type_of_assessment) The majority of the other non numerical attributes are dates. The most relative attributes are sex, age_cat and race that we want to rename in ethnicity for moral reasons.

In [17]:
dt.rename(columns={'race': 'ethnicity'}, inplace=True)
dt['sex'] = dt['sex'].astype('category')
dt['ethnicity'] = dt['ethnicity'].astype('category')
dt['age_cat'] = dt['age_cat'].astype('category')
dt['score_text'] = dt['score_text'].astype('category')
dt['v_score_text'] = dt['v_score_text'].astype('category')
In [18]:
categoricalAttributes = ['sex', 'ethnicity', 'age_cat', 'score_text', 'v_score_text']
for attribute in categoricalAttributes:
    val = dt[attribute].value_counts()
    val.plot(kind = 'bar', figsize = (5, 5))
    plt.ylabel('count')
    plt.xlabel(attribute)
    plt.show()
In [19]:
categoricalDT = dt[dt.columns.difference(numericAttributes)]
catAttributes = categoricalDT.columns.difference(['is_recid'])
In [20]:
for attribute in categoricalAttributes:
    if attribute == 'age_cat' or attribute == 'ethnicity' or attribute == 'sex' or attribute == 'score_text' or attribute == 'v_score_text':
        plt.figure(figsize = (10, 5))
        sn.countplot(x = dt[attribute], hue = 'is_recid', data = categoricalDT)
    plt.show()
In [21]:
categoricalDT = dt[dt.columns.difference(numericAttributes)]
catAttributes = categoricalDT.columns.difference(['is_violent_recid'])
In [22]:
for attribute in categoricalAttributes:
    if attribute == 'age_cat' or attribute == 'ethnicity' or attribute == 'sex' or attribute == 'score_text' or attribute == 'v_score_text':
        plt.figure(figsize = (10, 5))
        sn.countplot(x = dt[attribute], hue = 'is_violent_recid', data = categoricalDT)
    plt.show()

Histograms and bar plots according to the class attributes¶

In [23]:
numericDT.plot(kind='box', subplots=True, sharex=False, sharey=False, figsize=(15, 27), layout=(5, 4))
plt.show()
In [24]:
sn.pairplot(numericDT, hue = 'is_recid')
plt.show()
In [25]:
sn.pairplot(numericDT, hue = 'is_violent_recid')
plt.show()
In [26]:
plt.figure(figsize=(15, 7))
sn.heatmap(dt.corr(), annot=True, cmap='magma', fmt='.2f')
plt.show()

Working on dates¶

First we want to make the date as categorical attributes. Then using the function describe() we want to see the variability of these attributes. If the variability is low it will be more usefull to plot them.

The attributes describing the dates are:

  • dob
  • compas_screening_date
  • c_jail_in
  • c_jail_out
  • c_offense_date
  • c_arrest_date
  • r_offense_date
  • r_jail_in
  • r_jail_out
  • vr_offense_date
  • v_screening_date
  • screening_date
In [27]:
dt['dob'] = dt['dob'].astype('category')
dt['compas_screening_date'] = dt['compas_screening_date'].astype('category')
dt['c_jail_in'] = dt['c_jail_in'].astype('category')
dt['c_jail_out'] = dt['c_jail_out'].astype('category')
dt['c_offense_date'] = dt['c_offense_date'].astype('category')
dt['c_arrest_date'] = dt['c_arrest_date'].astype('category')
dt['r_offense_date'] = dt['r_offense_date'].astype('category')
dt['r_jail_in'] = dt['r_jail_in'].astype('category')
dt['r_jail_out'] = dt['r_jail_out'].astype('category')
dt['vr_offense_date'] = dt['vr_offense_date'].astype('category')
dt['v_screening_date'] = dt['v_screening_date'].astype('category')
dt['screening_date'] = dt['screening_date'].astype('category')
In [28]:
datesAttributes = ['dob', 'compas_screening_date', 'c_jail_in', 'c_jail_out', 'c_offense_date', 'c_arrest_date', 'r_offense_date', 'r_jail_in', 'r_jail_out', 'vr_offense_date', 'v_screening_date', 'screening_date']
for attribute in datesAttributes:
    val = dt[attribute].value_counts()
    print(val.describe())
    print("")
count    7800.000000
mean        1.507308
std         0.822150
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         6.000000
Name: dob, dtype: float64

count    704.000000
mean      16.700284
std        6.775800
min        1.000000
25%       12.000000
50%       16.000000
75%       21.000000
max       39.000000
Name: compas_screening_date, dtype: float64

count    10577.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: c_jail_in, dtype: float64

count    10517.000000
mean         1.005705
std          0.090252
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          4.000000
Name: c_jail_out, dtype: float64

count    1036.000000
mean        8.838803
std         6.645770
min         1.000000
25%         1.000000
50%         9.000000
75%        14.000000
max        29.000000
Name: c_offense_date, dtype: float64

count    802.000000
mean       2.316708
std        1.641212
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max        9.000000
Name: c_arrest_date, dtype: float64

count    1090.000000
mean        3.397248
std         1.957536
min         1.000000
25%         2.000000
50%         3.000000
75%         4.000000
max        12.000000
Name: r_offense_date, dtype: float64

count    984.000000
mean       2.500000
std        1.493288
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max        9.000000
Name: r_jail_in, dtype: float64

count    953.000000
mean       2.581322
std        1.596328
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max       10.000000
Name: r_jail_out, dtype: float64

count    599.000000
mean       1.472454
std        0.777286
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        6.000000
Name: vr_offense_date, dtype: float64

count    704.000000
mean      16.700284
std        6.775800
min        1.000000
25%       12.000000
50%       16.000000
75%       21.000000
max       39.000000
Name: v_screening_date, dtype: float64

count    704.000000
mean      16.700284
std        6.775800
min        1.000000
25%       12.000000
50%       16.000000
75%       21.000000
max       39.000000
Name: screening_date, dtype: float64

Data Preparation¶

The attributes that we want to include in our analysis are the following:

  • decile_score and v_decile_score as numerical attributes. They are important to show the score correlation with the class labels is_recid and is_violent_recid
  • sex, age_cat and race renamed asethnicity (for the same reasons explained before)
  • c_offense_date, r_offense_date and vr_offense_date to find a way to predict on avarage how many days pass, for a criminal, to bacome recid.

We decide to keep these attributes as a consequence of the previous step of Data Understanding. We decide to remove all the attributes with too high (typically IDs) or too low variability.

Select Data¶

In [29]:
dt = dt[dt.columns.difference(['id', 'age', 'decile_score', 'v_decile_score', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_be_screening_arrest', 'c_days_from_compas', 'num_r_cases', 'r_days_from_arrest', 'num_vr_cases', 'name', 'first', 'last', 'dob', 'compas_screening_date', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_charge_desc', 'v_type_of_assessment', 'v_screening_date', 'type_of_assessment', 'screening_date', 'days_b_screening_arrest', 'decile_score.1'])]
In [30]:
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11757 entries, 0 to 11756
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   age_cat           11757 non-null  category
 1   c_offense_date    9157 non-null   category
 2   ethnicity         11757 non-null  category
 3   is_recid          11757 non-null  int64   
 4   is_violent_recid  11757 non-null  int64   
 5   r_offense_date    3703 non-null   category
 6   score_text        11742 non-null  category
 7   sex               11757 non-null  category
 8   v_score_text      11752 non-null  category
 9   vr_offense_date   882 non-null    category
dtypes: category(8), int64(2)
memory usage: 412.9 KB
In [31]:
dt.shape
Out[31]:
(11757, 10)

Clean Data¶

Transformation¶

For simplicity and readability, we decided to rename the attribute age_cat in the following way:

  • 'Less then 25' becomes 'young';
  • '25-45' becomes 'adult'
  • 'Greater then 45' becomes 'senior'
In [50]:
dt.loc[:, ('age_cat')].replace(to_replace='Less than 25', value='young', inplace=True)
dt.loc[:, ('age_cat')].replace(to_replace='25 - 45', value='adult', inplace=True)
dt.loc[:, ('age_cat')].replace(to_replace='Greater than 45', value='senior', inplace=True)
In [51]:
val = dt['age_cat'].value_counts()
val
Out[51]:
adult     6272
senior    2467
young     2288
Name: age_cat, dtype: int64
Binarization¶

Since this is a binary classifcation problem and our class labels are is_recid and is_violent_recid that are integer, by plotting them in the previous steps, we saw that the possible values are 0 and 1. So we decide to keep them without binarization.

Removing unknown values¶

We suppose that the value -1 for the attribute is_recid is a value to identify the unknown information, so we want to drop that records.

In [52]:
dt=dt[dt.is_recid!=-1]
In [53]:
val = dt['is_recid'].value_counts()
val
Out[53]:
0    7326
1    3701
Name: is_recid, dtype: int64
Missing values¶
In [54]:
dt.isna().sum()
Out[54]:
age_cat                 0
c_offense_date       1880
ethnicity               0
is_recid                0
is_violent_recid        0
r_offense_date       7326
score_text              0
sex                     0
v_score_text            0
vr_offense_date     10145
dtype: int64
In [55]:
nullValues=dt.isnull().sum()/len(dt)
nullValues*100
Out[55]:
age_cat              0.000000
c_offense_date      17.049061
ethnicity            0.000000
is_recid             0.000000
is_violent_recid     0.000000
r_offense_date      66.436928
score_text           0.000000
sex                  0.000000
v_score_text         0.000000
vr_offense_date     92.001451
dtype: float64

We want to drop the rows with null score_text and v_score_text values.

In [56]:
dt=dt.drop(dt[dt['score_text'].isna()].index)
dt=dt.drop(dt[dt['v_score_text'].isna()].index)
In [57]:
dt.isna().sum()
Out[57]:
age_cat                 0
c_offense_date       1880
ethnicity               0
is_recid                0
is_violent_recid        0
r_offense_date       7326
score_text              0
sex                     0
v_score_text            0
vr_offense_date     10145
dtype: int64

For the modeling and prediciton part of the main goal we will not keep the attributes about dates. We erase c_offense_date, r_offense_date and vr_offense_date

In [58]:
# MAIN DATASET TO REACH THE PRIMARY GOAL (predict if a defendant becomes a recid)
dt_cleaned=dt[dt.columns.difference(['c_offense_date','is_violent_recid','r_offense_date','v_score_text','vr_offense_date'])]
dt_cleaned.info()
dt_cleaned.to_csv('dt_cleaned.csv', index=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11027 entries, 0 to 11756
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   age_cat     11027 non-null  category
 1   ethnicity   11027 non-null  category
 2   is_recid    11027 non-null  int64   
 3   score_text  11027 non-null  category
 4   sex         11027 non-null  category
dtypes: category(4), int64(1)
memory usage: 216.0 KB
In [59]:
dt_cleaned.shape
Out[59]:
(11027, 5)
In [60]:
# SECOND DATASET TO REACH THE SECOND GOAL (predict if a VIOLENT defendant becomes a VIOLENT recid)
dt_cleaned_v=dt[dt.columns.difference(['c_offense_date','r_offense_date','score_text','vr_offense_date'])]
dt_cleaned_v=dt_cleaned_v[dt_cleaned_v.is_recid!=0]
dt_cleaned_v=dt_cleaned_v[dt_cleaned_v.columns.difference(['is_recid'])]
dt_cleaned_v.info()
dt_cleaned_v.to_csv('dt_cleaned_v.csv', index=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3701 entries, 2 to 11753
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   age_cat           3701 non-null   category
 1   ethnicity         3701 non-null   category
 2   is_violent_recid  3701 non-null   int64   
 3   sex               3701 non-null   category
 4   v_score_text      3701 non-null   category
dtypes: category(4), int64(1)
memory usage: 72.9 KB
In [61]:
dt_cleaned_v.shape
Out[61]:
(3701, 5)

Work on new dataset¶

Now we need two more dataset to work on the prediction number of days using two new dataset split the main dataset according to the rows with is_recid=1 and is_violent_recid=1, respectly to show the values for r_offense_date and vr_offense_date.

In [62]:
from datetime import datetime
In [63]:
# DATASET TO predict how many days pass to become a recid
dt_date_r=dt[dt.columns.difference(['is_violent_recid','v_score_text','vr_offense_date'])]
dt_date_r=dt_date_r[dt_date_r.is_recid!=0]
dt_date_r=dt_date_r[dt_date_r.columns.difference(['is_recid'])]
dt_date_r=dt_date_r.drop(dt_date_r[dt_date_r['c_offense_date'].isna() | dt_date_r['r_offense_date'].isna()].index)

dates_diff=[]
for r, c in zip(dt_date_r['r_offense_date'], dt_date_r['c_offense_date']):
    dates_diff.append(((datetime.strptime(r, '%Y-%m-%d') - datetime.strptime(c, '%Y-%m-%d')).days))
dt_date_r['dates_diff_in_days']=dates_diff

dt_date_r=dt_date_r[dt_date_r.columns.difference(['c_offense_date','r_offense_date'])]
dt_date_r
Out[63]:
age_cat dates_diff_in_days ethnicity score_text sex
2 adult 160 African-American Low Male
3 young 64 African-American Low Male
7 adult 41 Caucasian Medium Male
12 young 736 Caucasian Low Male
14 young 128 African-American Medium Male
... ... ... ... ... ...
11736 senior 286 African-American Medium Male
11738 young 296 Caucasian Low Female
11746 young 9 African-American Low Male
11751 senior 30 African-American Low Male
11753 young 513 Caucasian Medium Male

3093 rows × 5 columns

In [64]:
dt_date_r.info()
dt_date_r.to_csv('dt_date_r.csv', index=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3093 entries, 2 to 11753
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   age_cat             3093 non-null   category
 1   dates_diff_in_days  3093 non-null   int64   
 2   ethnicity           3093 non-null   category
 3   score_text          3093 non-null   category
 4   sex                 3093 non-null   category
dtypes: category(4), int64(1)
memory usage: 61.0 KB
In [65]:
# DATASET TO predict how many days pass to become a violent recid
dt_date_v=dt[dt.columns.difference(['score_text','r_offense_date'])]
dt_date_v=dt_date_v[dt_date_v.is_recid!=0]
dt_date_v=dt_date_v[dt_date_v.is_violent_recid!=0]
dt_date_v=dt_date_v[dt_date_v.columns.difference(['is_recid', 'is_violent_recid'])]
dt_date_v=dt_date_v.drop(dt_date_v[dt_date_v['c_offense_date'].isna() | dt_date_v['vr_offense_date'].isna()].index)

dates_diff_v=[]
for vr, c in zip(dt_date_v['vr_offense_date'], dt_date_v['c_offense_date']):
    dates_diff_v.append(((datetime.strptime(vr, '%Y-%m-%d') - datetime.strptime(c, '%Y-%m-%d')).days))
dt_date_v['dates_diff_in_days']=dates_diff_v

dt_date_v=dt_date_v[dt_date_v.columns.difference(['c_offense_date','vr_offense_date'])]
dt_date_v
Out[65]:
age_cat dates_diff_in_days ethnicity sex v_score_text
2 adult 160 African-American Male Low
12 young 736 Caucasian Male Medium
22 adult 242 Caucasian Male Low
36 adult 659 African-American Male Low
39 adult 296 African-American Male Medium
... ... ... ... ... ...
11675 young 217 African-American Male Medium
11678 adult 252 African-American Male Low
11680 adult 926 African-American Male Medium
11683 senior 741 African-American Male High
11696 adult 337 Caucasian Male Low

728 rows × 5 columns

In [66]:
dt_date_v.info()
dt_date_v.to_csv('dt_date_v.csv', index=False)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 728 entries, 2 to 11696
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   age_cat             728 non-null    category
 1   dates_diff_in_days  728 non-null    int64   
 2   ethnicity           728 non-null    category
 3   sex                 728 non-null    category
 4   v_score_text        728 non-null    category
dtypes: category(4), int64(1)
memory usage: 14.8 KB